Skip to content

Simple B&B restart#1415

Draft
nguidotti wants to merge 29 commits into
NVIDIA:mainfrom
nguidotti:simple-restart
Draft

Simple B&B restart#1415
nguidotti wants to merge 29 commits into
NVIDIA:mainfrom
nguidotti:simple-restart

Conversation

@nguidotti

@nguidotti nguidotti commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

This PR implements a simple restart procedure for B&B. More specifically,

  1. Check if the solver progress stagnates using the gap and the tree size estimation via tree weight.
  2. If the solver stagnates N consecutive times, then it trigger a restart.
  3. Stop all B&B tasks (best-first, diving and RINS)
  4. Apply reduced cost fixing if the upper bound has changed since the last restart
  5. Clean the previous state (tree, workers). Carry over the pseudocosts.
  6. Start the B&B procedure again

References

[1] G. Hendel, “Adaptive solver behavior in mixed-integer programming,” Doctoral, Technische Universität Berlin, Berlin, 2022. [Online]. Available: https://depositonce.tu-berlin.de/items/d6d22f30-37b3-42ed-afb9-be9c0aaa2931
[2] "HiGHS - Linear optimization software". [Online]. Available: https://github.com/ERGO-Code/HiGHS

Results

MIPLIB2017, 10min, GH200

================================================================================
 main-2026-06-01 (1) vs simple-restart (2)
================================================================================

------------------------------------------------------------------------------------------------------------------------------
|                                        |       Run 1        |       Run 2        |     Abs. Diff.     |   Rel. Diff. (%)   |
------------------------------------------------------------------------------------------------------------------------------
| Imported                                                 240                  240                   +0                 --- |
| Feasible                                                 226                  228                   +2                 --- |
| Optimal                                                   83                   85                   +2                 --- |
| Solutions with <0.1% primal gap                          134                  141                   +7                 --- |
| Nodes explored (mean)                              1.329e+07            1.312e+07           -1.653e+05               -1.24 |
| Nodes explored (shifted geomean)                   1.426e+04            1.263e+04                -1632               -11.4 |
| Relative MIP gap (mean)                               0.2893                0.294             +0.00466               +1.61 |
| Relative MIP gap (shifted geomean)                   0.09705              0.09479            -0.002263               -2.33 |
| Solve time (mean)                                      426.6                421.2               -5.444               -1.28 |
| Solve time (shifted geomean)                           202.4                192.5               -9.887               -4.88 |
| Primal gap (mean)                                      11.31                10.48              -0.8358               -7.39 |
| Primal gap (shifted geomean)                          0.5319                0.489             -0.04293               -8.07 |
| Primal integral (mean)                                 32.45                31.87              -0.5818               -1.79 |
| Primal integral (shifted geomean)                      6.582                6.827              +0.2442               +3.71 |
------------------------------------------------------------------------------------------------------------------------------

Overall, I see an increase of the number of optimal solutions by 3 (unitcal_7, glass-sc, rococoC10-001000), but it no longer proves optimality for neos-5093327-huahum.

Checklist

  • I am familiar with the Contributing Guidelines.
  • Testing
    • New or existing tests cover these changes
    • Added tests
    • Created an issue to follow-up
    • NA
  • Documentation
    • The documentation is up to date with these changes
    • Added new documentation
    • NA

nguidotti added 25 commits June 1, 2026 11:05
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…an method.

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…ogs to use the std::format variant.

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…ly with restarts.

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…restart

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
@nguidotti nguidotti added this to the 26.08 milestone Jun 10, 2026
@nguidotti nguidotti requested a review from a team as a code owner June 10, 2026 09:13
@nguidotti nguidotti added the non-breaking Introduces a non-breaking change label Jun 10, 2026
@nguidotti nguidotti requested a review from mlubin June 10, 2026 09:13
@nguidotti nguidotti added the improvement Improves an existing functionality label Jun 10, 2026
@nguidotti nguidotti requested a review from kaatish June 10, 2026 09:13
@nguidotti nguidotti marked this pull request as draft June 10, 2026 09:14
@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@nguidotti nguidotti requested review from chris-maes and removed request for kaatish June 10, 2026 09:14
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This pull request implements a restart mechanism for branch-and-bound MIP solving. It extracts search tree management into a dedicated header, adds restart-triggered exploration loops, integrates concurrent halt signaling across workers and heuristics, and implements worker state reset for efficient restarts without full reallocation.

Changes

Restart-driven branch-and-bound exploration and integration

Layer / File(s) Summary
Search tree refactoring into dedicated header
cpp/src/branch_and_bound/mip_node.hpp, cpp/src/branch_and_bound/search_tree.hpp, cpp/src/branch_and_bound/branch_and_bound.hpp
search_tree_t<i_t, f_t> class is extracted from mip_node.hpp into new search_tree.hpp with thread-safe update/branch/clean methods for concurrent node management; mip_node move operations change from noexcept to allow move flexibility.
Restart infrastructure: status, constructor, and member fields
cpp/src/branch_and_bound/branch_and_bound.hpp, cpp/src/branch_and_bound/branch_and_bound.cpp
mip_status_t gains RESTART = 8 enum value; branch_and_bound_t constructor accepts std::atomic<int>* restart_concurrent_halt parameter and stores it alongside new restart_count_ member field; header includes search_tree.hpp.
Restart solver settings and stats tracking
cpp/src/dual_simplex/simplex_solver_settings.hpp, cpp/src/branch_and_bound/worker.hpp
simplex_solver_settings_t adds seven restart-related configuration members (min nodes, thresholds, frequencies, max restarts); branch_and_bound_stats_t adds total_nodes_explored counter and four restart checkpoint fields.
Restart detection and decision logic
cpp/src/branch_and_bound/branch_and_bound.cpp
New should_restart(current_abs_gap) method estimates restart tree size using explored-node growth, progress tracking, and gap reduction to decide whether to request a restart.
Worker and node queue reset mechanisms
cpp/src/branch_and_bound/worker.hpp, cpp/src/branch_and_bound/worker_pool.hpp, cpp/src/branch_and_bound/node_queue.hpp
bfs_worker_t and diving_worker_t gain reset_state() methods; worker_pool_t gains reset() method to reinitialize all workers under mutex; node_queue_t and heap_t gain clear() methods.
Pseudo-cost concurrent halt wiring
cpp/src/branch_and_bound/pseudo_costs.hpp, cpp/src/branch_and_bound/pseudo_costs.cpp
pseudo_costs_t adds concurrent_halt_ member pointer; trial_branching signature extended with std::atomic<int>* concurrent_halt parameter; separate pdlp_concurrent_halt atomic introduced for Batch PDLP independent halt control in reliability branching.
Restart checking and halting in plunge exploration
cpp/src/branch_and_bound/branch_and_bound.cpp
Worker 0 periodically invokes should_restart() during plunge; on trigger sets solver_status_ = RESTART, raises concurrent halt flags, runs heuristic repair, and breaks; exploration accounting uses aggregate exploration_stats_.total_nodes_explored.
Main solve loop restart loop refactoring
cpp/src/branch_and_bound/branch_and_bound.cpp
Core solve() exploration refactored into do { ... } while (solver_status_ == RESTART) loop; each iteration reinitializes workers, reapplies bound-strengthening/symmetry, resets tracking fields, rebuilds search tree, and launches exploration; uses search_tree_.clean() and worker_pool.reset() between restarts.
Worker loop restart halt checks and instrumentation
cpp/src/branch_and_bound/branch_and_bound.cpp
BFS and diving worker loops check *restart_concurrent_halt_ == 1 and solver_status_ == RESTART to exit promptly; nvtx::range instrumentation added to launch/stealing/diving/LP/cut pass functions.
Dual simplex solve restart halt integration
cpp/src/dual_simplex/solve.cpp
solve() and solve_mip_with_guess() create local std::atomic<int> restart_concurrent_halt and pass pointer to branch_and_bound_t constructor for MIP paths.
Heuristics solver integration of restart halt
cpp/src/mip_heuristics/solver_context.cuh, cpp/src/mip_heuristics/solver.cu, cpp/src/mip_heuristics/diversity/lns/rins.cu, cpp/src/mip_heuristics/diversity/recombiners/sub_mip.cuh
mip_solver_context_t adds restart_concurrent_halt member; RINS, sub_mip_recombiner, and mip_solver pass &context.restart_concurrent_halt to branch_and_bound_t constructor; RINS wires separate FJ preemption halt.
FJ CPU preemption flag pointer migration
cpp/src/mip_heuristics/feasibility_jump/fj_cpu.cuh, cpp/src/mip_heuristics/feasibility_jump/fj_cpu.cu
fj_cpu_climber_t::preemption_flag changes from reference to pointer; constructor stores address; loop checks use ->load() and halt signals use ->store().
Symmetry generator sizing fix
cpp/src/branch_and_bound/symmetry.hpp
orbital_fixing_t::reset() resizes surviving_generators_ using current symmetry->num_generators instead of cached max_generators_.

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Simple B&B restart' clearly and concisely describes the main feature added in this PR: a restart mechanism for branch-and-bound optimization.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description clearly explains the restart mechanism, its objectives (stagnation detection, restart triggering, stopping tasks, cost fixing, state cleanup), and provides benchmark results.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/src/mip_heuristics/diversity/lns/rins.cu (1)

217-232: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Make the CPUFJ child task shutdown exception-safe.

fj_cpu->preemption_flag is rebound to a stack-local fj_halt_flag, then cpufj_solve() is launched as a child OpenMP task. The stop path only happens later on the normal path. If any call in between throws, the task can keep polling a dead stack atomic while fj_cpu is being unwound. Please move the halt flag/task ownership behind an RAII guard or a task object with stable lifetime so stop/join happens on every exit path.

As per coding guidelines, **/*.{cpp,cu,hpp,cuh}: Use RAII for all resource management including exception paths in C++.

Also applies to: 268-307

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/mip_heuristics/diversity/lns/rins.cu` around lines 217 - 232, The
child OpenMP task binds fj_cpu->preemption_flag to the stack-local fj_halt_flag
which can dangle if an exception is thrown before the task completes; fix by
giving the task a stable, RAII-managed flag and ownership so it is valid for the
task's lifetime: allocate the halt flag with shared lifetime (e.g., a
heap/shared_ptr<atomic<bool>> or embed it inside a RAII task object) and set
fj_cpu->preemption_flag to that stable address, ensure the RAII object owns
joining/stopping the task and is destroyed on all exit paths; update the code
paths around fj_cpu, fj_halt_flag, and cpufj_solve to use that RAII-managed flag
so preemption_flag never points to stack memory.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/branch_and_bound/branch_and_bound.cpp`:
- Around line 743-745: The logged/exported counter was switched to
exploration_stats_.total_nodes_explored but solve_node_deterministic() only
increments exploration_stats_.nodes_explored, so deterministic runs report zero;
update the producer-side accounting inside solve_node_deterministic() (or the
deterministic node-visit path) to also increment
exploration_stats_.total_nodes_explored (or increment both
exploration_stats_.nodes_explored and exploration_stats_.total_nodes_explored)
wherever nodes are counted, ensuring the same counter used by the log/return
paths is updated for deterministic and opportunistic solves.
- Around line 1563-1572: The debug call in branch_and_bound.cpp using
settings_.log.debug mixes printf-style format specifiers with a "{}" placeholder
so nodes_since_last_check is never printed; change the "{}" to the appropriate
printf specifier (e.g., "%d") so the format string matches the arguments
(num_nodes, current_progress, current_abs_gap, nodes_since_last_check,
progress_since_last_check, gap_reduction, tree_size_estimate) and ensure the
order of placeholders aligns with those variables.
- Around line 2982-2985: The restart checkpoint mixes internal and user-space
units: set exploration_stats_.restart_gap_at_last_check using user-space units
(call compute_user_abs_gap() or otherwise convert upper_bound_ -
get_lower_bound() into user-space) so it matches what should_restart() compares;
update the initialization in the same block (replace upper_bound_ -
get_lower_bound() with compute_user_abs_gap()) to keep units consistent with
should_restart() and compute_user_abs_gap().

In `@cpp/src/branch_and_bound/worker.hpp`:
- Around line 230-237: bfs_worker_t::reset_state() currently clears node_queue
and zeroes total_active_diving_workers but does not reset the per-strategy
active_diving_workers array, leaving stale counts; modify reset_state() to also
zero all entries of active_diving_workers (e.g., fill or loop over
active_diving_workers to set each element to 0) so per-strategy state is
consistent after restart, while keeping the existing resets for
total_max_diving_workers, total_active_diving_workers, is_active, and
lower_bound.

---

Outside diff comments:
In `@cpp/src/mip_heuristics/diversity/lns/rins.cu`:
- Around line 217-232: The child OpenMP task binds fj_cpu->preemption_flag to
the stack-local fj_halt_flag which can dangle if an exception is thrown before
the task completes; fix by giving the task a stable, RAII-managed flag and
ownership so it is valid for the task's lifetime: allocate the halt flag with
shared lifetime (e.g., a heap/shared_ptr<atomic<bool>> or embed it inside a RAII
task object) and set fj_cpu->preemption_flag to that stable address, ensure the
RAII object owns joining/stopping the task and is destroyed on all exit paths;
update the code paths around fj_cpu, fj_halt_flag, and cpufj_solve to use that
RAII-managed flag so preemption_flag never points to stack memory.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: baf4c7cc-54ed-4aa7-8328-a5f69d4d0b1f

📥 Commits

Reviewing files that changed from the base of the PR and between dcb97b7 and 4ea3cc2.

📒 Files selected for processing (18)
  • cpp/src/branch_and_bound/branch_and_bound.cpp
  • cpp/src/branch_and_bound/branch_and_bound.hpp
  • cpp/src/branch_and_bound/mip_node.hpp
  • cpp/src/branch_and_bound/node_queue.hpp
  • cpp/src/branch_and_bound/pseudo_costs.cpp
  • cpp/src/branch_and_bound/pseudo_costs.hpp
  • cpp/src/branch_and_bound/search_tree.hpp
  • cpp/src/branch_and_bound/symmetry.hpp
  • cpp/src/branch_and_bound/worker.hpp
  • cpp/src/branch_and_bound/worker_pool.hpp
  • cpp/src/dual_simplex/simplex_solver_settings.hpp
  • cpp/src/dual_simplex/solve.cpp
  • cpp/src/mip_heuristics/diversity/lns/rins.cu
  • cpp/src/mip_heuristics/diversity/recombiners/sub_mip.cuh
  • cpp/src/mip_heuristics/feasibility_jump/fj_cpu.cu
  • cpp/src/mip_heuristics/feasibility_jump/fj_cpu.cuh
  • cpp/src/mip_heuristics/solver.cu
  • cpp/src/mip_heuristics/solver_context.cuh

Comment thread cpp/src/branch_and_bound/branch_and_bound.cpp
Comment thread cpp/src/branch_and_bound/branch_and_bound.cpp Outdated
Comment thread cpp/src/branch_and_bound/branch_and_bound.cpp Outdated
Comment thread cpp/src/branch_and_bound/worker.hpp
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
@nguidotti

Copy link
Copy Markdown
Contributor Author

/ok to test 3037981

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality mip non-breaking Introduces a non-breaking change P0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant